Slurm jobs management

This page is a practical command reference for managing jobs once you already know the basics from the Slurm (quick guide).

For partition defaults and limits, see Advanced partitions.

Job lifecycle

1) Prepare a batch script

Use the template provided in your home directory:

cp ~/slurm-prod10.sbatch ./job.sbatch
nano job.sbatch

At minimum, set:

  • partition (#SBATCH --partition=...)
  • walltime (#SBATCH --time=...)
  • the Python command to run

2) Submit the job

sbatch job.sbatch

You will get a job ID, for example:

Submitted batch job 29509

3) Monitor queue and status

squeue -u $USER
scontrol show job <jobid>

Useful job states:

  • PD: pending
  • R: running
  • CG: completing
  • CD: completed
  • F: failed
  • TO: timed out

4) Check accounting/history

sacct -j <jobid> --format=JobID,JobName,Partition,State,Elapsed,ExitCode

5) Read logs

By default, output goes to slurm-<jobid>.out (or your custom --output/--error paths).

ls -lh slurm-<jobid>.out
tail -n 100 slurm-<jobid>.out

6) Cancel a job

scancel <jobid>

Fairshare and priority

When resources are busy, priority is influenced by fairshare and other scheduler factors.

sshare -l
sprio

Advanced patterns

Job arrays

Use arrays for many independent runs:

sbatch --array=0-31 job.sbatch
sbatch --array=1,3,5,7 job.sbatch
sbatch --array=1-7:2 job.sbatch

Inside the script, use SLURM_ARRAY_TASK_ID.

Job dependencies (chaining)

sbatch step1.sbatch
sbatch --dependency=afterok:74698 step2.sbatch
sbatch --dependency=afterok:74698:74699 step3.sbatch

Common rules:

  • after:<jobid>
  • afterany:<jobid>
  • afterok:<jobid>
  • afternotok:<jobid>
  • singleton

If a dependency can never be satisfied, cancel the stuck jobs with scancel.

Notes specific to this DGX

  • With QoS normal, you can run up to 4 jobs at the same time.
  • With QoS normal, only 2 running jobs are allowed across prod40 + prod80.
  • Partition is required in submissions.
  • prod10, prod40, prod80 are batch-oriented (use sbatch).
  • Use interactive10 with srun for interactive GPU debugging.
  • If you need more resources (or another QoS policy), contact support: dgx_support@listes.centralesupelec.fr.